# Linux kernel flame graphs

This post is a nothing-burger. All I wanted was to have those words together on the Internet: "Linux kernel flame graph profiling", because I spent about an hour looking for something I already knew would take 2 minutes to execute.

So, to whoever comes to ask the same question as I did back then, the solution is not `perf`, it' also not `trace-cmd`, but instead it's eBPF-based. It's bcc. Read the description of the profile tool. TL;DR:

```
./bcc/tools/profile.py -dfK > profile.flame
flamegraph.pl < profile.flame > flame.svg
```

Conveniently, Fedora has both flamegraph and the bcc library (not the tools) in the repos already.

This is what comes out the other end:

Flame graph showing multiple programs including Web, Xorg, plasmashell. There are 2 large spikes named `idle_inj`, and two spikes called `kworker`, with `btrfs_preempt_reclaim_metadata_space` up on the stack. Those two take up almost 50% width

As you can see, my system is spending *a lot* of time in the kernel, messing with btrfs. Then it overheats, causing it to spend *even more* time in the kernel, due to idle injection. Then, it gets really sluggish, making the operator look for kernel profiling tools, and the operator finally writes a blog post about this.

The bug looks like this report.

## Nothing-burger

If you thought I had something valuable to say, then… Well okay, maybe I shouldn't write blog posts if they are useless to anyone but me. You win. Give me a moment, I'll come up with something insightful to say.


Maybe you're not familiar with flame graphs, but they are a really nice way to check for performance issues. They show the amount of time the resource (in this case the CPU) is busy with each task. The width of the graph is 100% time, and the width of the different bars is the portion of time occupied by something.

This flame graph does not distinguish between different runs of the same task. If we were monitoring tasks done by Joe over the day, and he'd spend the morning slicing potatoes, then chopped wood, and then sliced potatoes again, this kind of a graph would only distinguish two tasks: slicing potatoes and chopping wood.

This makes sense when you consider why the bars are stacked vertically: each bar on top of another is a sub-task. If a task is composed of a hundred sub-tasks, presenting them together makes the picture clear. When we care about optimization, then it's useful to know how the total time spent peeling compares to the time spent cutting. The times for each potato aren't so relevant. It doesn't even matter a lot how we order the results on the left-right axis!

It's especially relevant for computers, where flame graphs may be used to get an idea how much time is spent in a procedure that may be called a thousand times. While the graph won't say how many times it was executed, it will show where it was called from.

## Bonus

This is about the kernel, which is notoriously hard to debug, so let me share my most useful snippets for debugging. Those use trace-cmd, and give extra information without the need to recompile.

  1. Prints the names of functions (with some exceptions like inlining) along the execution path:
```
sudo trace-cmd record -p function_graph -g start_procedure -F ./program_under_test
```
  1. Prints the call stack whenever the chosen procedure is executed:
```
sudo trace-cmd record -p function_graph -g procedure_under_test -n printk -n dev_printk_emit -F ./program_under_test
```

View results with `trace-cmd report`.

dcz's projects

Thoughts on software and society.

Atom feed